Goto

Collaborating Authors

 dynamic scene


MoPo-Fr123121Dyee namsneo4D cuicla GContraus Vidsreolioan PoiSplntats ting

Neural Information Processing Systems

Novel view synthesis from monocular videos of dynamic scenes with unknown While camera recent poses remains advances a in fundamental 3D representations challenge such in computer as Neural vision Radiance and graphics. Fields (NeRF) scenes, and they 3D struggle Gaussian with Splatting dynamic (3DGS) content ha and ve sho typically wn promising rely on results pre-computed for static camera poses. We present 4D3R, a pose-free dynamic neural rendering framework that Our method decouples first static leverages and dynamic 3D foundational components models through for initial a tw pose o-stage and approach.


Track3R: Joint Point Map and Trajectory Prior for Spatiotemporal 3DUnderstanding

Neural Information Processing Systems

Understanding the 3D world from 2D monocular videos is a crucial ability for AI. Recently, to tackle this underdetermined task, end-to-end 3D geometry priors have been sought after, such as pre-trained point map models at scale. These models enable robust 3D understanding from casually taken videos, providing accurate object shapes disentangled from uncertain camera parameters. However, they still struggle when affected by object deformation and dynamics, failing to establish consistent correspondence over the frames. Furthermore, their architectures are typically limited to pairwise frame processing, which is insufficient for capturing complex motion dynamics over extended sequences. To address these limitations, we introduce Track3R, a novel framework that integrates a new architecture and task to jointly predict point map and motion trajectories across multiple frames from video input. Specifically, our key idea is modeling two disentangled trajectories for each point: one representing object motion and the other camera poses. This design not only can enable understanding of the 3D object dynamics, but also facilitates the learning of more robust priors for 3D shapes in dynamic scenes. In our experiments, Track3R demonstrates significant improvements in a joint point mapping and 3D motion estimation task for dynamic scenes, such as 25.8% improvements in the motion estimation, and 15.7% in the point mapping accuracy.


Flux4D: Flow-based Unsupervised 4DReconstruction

Neural Information Processing Systems

Reconstructing large-scale dynamic scenes from visual observations is a fundamental challenge in computer vision. While recent differentiable rendering methods such as NeRF and 3DGS have achieved impressive photorealistic reconstruction, they suffer from scalability limitations and require annotations to decouple moving actors from the static scene, such as in autonomous driving scenarios. Existing selfsupervised methods attempt to eliminate explicit annotations by leveraging motion cues and geometric priors, yet they remain constrained by per-scene optimization and sensitivity to hyperparameter tuning. In this paper, we introduce Flux4D, a simple and scalable framework for 4D reconstruction of large-scale dynamic driving scenes. Flux4D directly predicts 3DGaussians and their motion dynamics to reconstruct sensor observations in a fully unsupervised manner. By adopting only photometric losses and enforcing an "as static as possible" regularization, Flux4D learns to decompose dynamic elements directly from raw data without requiring pre-trained supervised models or foundational priors simply by training across many scenes. Our approach enables efficient reconstruction of dynamic scenes within seconds, scales effectively to large datasets, and generalizes well to unseen environments, including rare and unknown objects. Experiments on outdoor driving datasets show Flux4D significantly outperforms existing methods in scalability, generalization, and reconstruction quality.


4DGCPro: Efficient Hierarchical 4DGaussian Compression for Progressive Volumetric Video Streaming

Neural Information Processing Systems

Achieving seamless viewing of high-fidelity volumetric video, comparable to 2D video experiences, remains an open challenge. Existing volumetric video compression methods either lack the flexibility to adjust quality and bitrate within a single model for efficient streaming across diverse networks and devices, or struggle with real-time decoding and rendering on lightweight mobile platforms. To address these challenges, we introduce 4DGCPro, a novel hierarchical 4DGaussian compression framework that facilitates real-time mobile decoding and high-quality rendering via progressive volumetric video streaming in a single bitstream. Specifically, we propose a perceptually-weighted and compression-friendly hierarchical 4D Gaussian representation with motion-aware adaptive grouping to reduce temporal redundancy, preserve coherence, and enable scalable multi-level detail streaming. Furthermore, we present an end-to-end entropy-optimized training scheme, which incorporates layer-wise rate-distortion (RD) supervision and attribute-specific entropy modeling for efficient bitstream generation. Extensive experiments show that 4DGCPro enables flexible quality and multiple bitrate within a single model, achieving real-time decoding and rendering on mobile devices while outperforming existing methods in RD performance across multiple datasets. The corresponding author is Qiang Hu(qiang.hu@sjtu.edu.cn)


Holistic Gaussian Splatting for Embodied View Synthesis

Neural Information Processing Systems

We propose HoliGS, a novel deformable Gaussian splatting framework that addresses embodied view synthesis from long monocular RGB videos. Unlike prior 4DGaussian splatting and dynamic NeRF pipelines, which struggle with training overhead in minute-long captures, our method leverages invertible Gaussian Splatting deformation networks to reconstruct large-scale, dynamic environments accurately. Specifically, we decompose each scene into a static background plus time-varying objects, each represented by learned Gaussian primitives undergoing global rigid transformations, skeleton-driven articulation, and subtle non-rigid deformations via an invertible neural flow. This hierarchical warping strategy enables robust free-viewpoint novel-view rendering from various embodied camera trajectories by attaching Gaussians to a complete canonical foreground shape (e.g., egocentric or third-person follow), which may involve substantial viewpoint changes and interactions between multiple actors. Our experiments demonstrate that HoliGS achieves superior reconstruction quality on challenging datasets while significantly reducing both training and rendering time compared to state-of-the-art monocular deformable NeRFs.


Dynamic Gaussian Splatting from Defocused and Motion-blurred Monocular Videos

Neural Information Processing Systems

This paper presents a unified framework that allows high-quality dynamic Gaussian Splatting from both defocused and motion-blurred monocular videos. Due to the significant difference between the formation processes of defocus blur and motion blur, existing methods are tailored for either one of them, lacking the ability to simultaneously deal with both of them. Although the two can be jointly modeled as blur kernel-based convolution, the inherent difficulty in estimating accurate blur kernels greatly limits the progress in this direction. In this work, we go a step further towards this direction. Particularly, we propose to estimate per-pixel reliable blur kernels using a blur prediction network that exploits blur-related scene and camera information and is subject to a blur-aware sparsity constraint. Besides, we introduce a dynamic Gaussian densification strategy to mitigate the lack of Gaussians for incomplete regions, and boost the performance of novel view synthesis by incorporating unseen view information to constrain scene optimization. Extensive experiments show that our method outperforms the state-of-the-art methods in generating photorealistic novel view synthesis from defocused and motion-blurred monocular videos.


Reconstruct, Inpaint, Test-Time Finetune: Dynamic Novel-view Synthesis from Monocular Videos

Neural Information Processing Systems

We explore novel-view synthesis for dynamic scenes from monocular videos. Prior approaches rely on costly test-time optimization of 4D representations or do not preserve scene geometry when trained in a feed-forward manner. Our approach is based on three key insights: (1) covisible pixels (that are visible in both the input and target views) can be rendered by first reconstructing the dynamic 3D scene and rendering the reconstruction from the novel-views and (2) hidden pixels in novel views can be "inpainted" with feed-forward 2D video diffusion models. Notably, our video inpainting diffusion model (CogNVS) can be self-supervised from 2D videos, allowing us to train it on a large corpus of in-the-wild videos. This in turn allows for (3) CogNVS to be applied zero-shot to novel test videos via test-time finetuning. We empirically verify that CogNVS outperforms almost all prior art for novel-view synthesis of dynamic scenes from monocular videos.


2b0e14abd8128e6bf98b6b0bec1cfcbf-Paper-Conference.pdf

Neural Information Processing Systems

Gaussian within two representation, minutes, Our method while achie maintaining reconstruct ving a 30 a competiti single speed-up video ve and performance within reducing 10 website minutes apply our is on model published the Dycheck to in-the-wild at https://instant4d.gi dataset videos, or for sho a wcasing typical thub.io/ its 200-frame generalizability .


Spike4DGS: Towards High-Speed Dynamic Scene Recontruction with 4DGaussian Splatting via a Spike Camera Array

Neural Information Processing Systems

Spike camera with high temporal resolution offers a new perspective on highspeed dynamic scene rendering. Most existing rendering methods rely on Neural Radiance Fields (NeRF) or 3DGaussian Splatting (3DGS) for static scenes using a monocular spike camera. However, these methods struggle with dynamic motion, while a single camera suffers from limited spatial coverage, making it challenging to reconstruct fine details in high-speed scenes. To address these problems, we propose Spike4DGS, the first high-speed dynamic scene rendering framework with 4DGaussian Splatting using spike camera arrays. Technically, we first build a multi-view spike camera array to validate our solution, then establish both synthetic and real-world multi-view spike-based reconstruction datasets. Then, we design a multi-view spike-based dense initialization module that obtains dense point clouds and camera poses from continuous spike streams. Finally, we propose a spikepixel synergy constraint supervision to optimize Spike4DGS, incorporating both rendered image quality loss and dynamic spatiotemporal spike loss. The results show that our Spike4DGS outperforms state-of-the-art methods in terms of novel view rendering quality on both synthetic and real-world datasets. More details are available at the project page.


Track3R: Joint Point Map and Trajectory Prior for Spatiotemporal 3D Understanding

Neural Information Processing Systems

Understanding the 3D world from 2D monocular videos is a crucial ability for AI. Recently, to tackle this underdetermined task, end-to-end 3D geometry priors have been sought after, such as pre-trained point map models at scale. These models enable robust 3D understanding from casually taken videos, providing accurate object shapes disentangled from uncertain camera parameters. However, they still struggle when affected by object deformation and dynamics, failing to establish consistent correspondence over the frames. Furthermore, their architectures are typically limited to pairwise frame processing, which is insufficient for capturing complex motion dynamics over extended sequences. To address these limitations, we introduce Track3R, a novel framework that integrates a new architecture and task to jointly predict point map and motion trajectories across multiple frames from video input. Specifically, our key idea is modeling two disentangled trajectories for each point: one representing object motion and the other camera poses. This design not only can enable understanding of the 3D object dynamics, but also facilitates the learning of more robust priors for 3D shapes in dynamic scenes. In our experiments, Track3R demonstrates significant improvements in a joint point mapping and 3D motion estimation task for dynamic scenes, such as 25.8% improvements in the motion estimation, and 15.7% in the point mapping accuracy.